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Abstract 

Primitives  are  presented  that  are  designed  to  help  users  efficiently  program 
irregular  problems  (e.g.  unstructured  mesh  sweeps,  sparse  matrix  codes,  adap¬ 
tive  mesh  partial  differential  equations  solvers)  on  distributed  memory  ma¬ 
chines.  These  primitives  are  also  designed  for  use  in  compilers  for  distributed 
memory  multiprocessors.  Communications  patterns  are  captured  at  runtime, 
and  the  appropriate  send  and  receive  messages  are  automatically  generated. 
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1  Did  Somebody  Say  PARTI? 

1.1  Overview 

PARTI  stands  for  “Parallel  Automated  Runtime  Toolkit  at  ICASE."  Development 
of  PARTI  has  been  carried  out  at  Yale  University  as  well  as  ICASE  and  hence  has 
been  referred  to  as  “PARTY51  in  some  earlier  papers.  The  PARTI  runtime  primitives 
are  designed  to  help  users  to  efficiently  program  loops  found  in  irregular  problems 
(e.g.  unstructured  mesh  sweeps,  sparse  matrix  codes,  adaptive  mesh  partial  differ¬ 
ential  equations  solvers).  These  primitives  are  also  designed  for  use  in  compilers  for 
distributed  memory  multiprocessors.  In  the  context  of  the  PARTI  project,  we  are 
also  developing  a  variety  of  other  tools  including  compilers  for  distributed  machines. 
These  primitives  are  some  of  the  basic  building  blocks  we  are  using  in  our  efforts. 

The  primitives  in  this  distribution  run  on  any  of  the  iPSC/2  or  iPSC/860  machines 
produced  by  Intel  Scientific  Computing.  They  could  easily  be  modified  to  run  on  most 
distributed  memory  machines.  This  document  describes  the  operation  of  the  PARTI 
primitives  and  gives  several  examples  of  how  to  use  them.  The  rationale  of  the  PARTI 
system  (the  PARTI  line,  as  it  were)  was  presented  in  [2]  and  summarized  in  [4]. 
The  mechanisms  incorporated  in  these  primitives  have  been  outlined  in  [2],  [5], 
[4].  PARTI  has  been  used  in  a  variety  of  applications,  including  sparse  matrix  linear 
solvers,  adaptive  computational  fluid  dynamics  codes,  and  in  a  prototype  compiler 
[4]  aimed  at  distributed  memory  multiprocessors. 

1.2  Primitives  Available  in  the  Release 

The  PARTI  system  is  divided  into  several  levels.  Level  0  primitives  allow  proces¬ 
sors  to  access  the  distributed  memory  of  a  multiprocessor  with  a  modicum  of  con¬ 
venience.  Level  1  primitives  bind  mapping  information  to  arrays.  This  allows  the 
user  to  store  and  manipulate  constructs  that  describe  multiprocessor  mappings  of 
distributed  multidimensional  arrays.  Included  with  this  distribution  are  the  level  0 
primitives  outlined  next. 

The  level  0  scatter  allows  each  processor  of  a  distributed  memory  machine  to  move 
data  to  off-processor  memory  locations.  The  level  0  gather  allows  each  processor  to 
obtain  copies  of  data  from  memory  locations  in  other  processors.  Level  0  primitives 
are  provided  to  support  initialization  and  access  of  distributed  translation  tables. 
Such  distributed  tables  allow  a  user  to  assign  globally  numbered  indices  to  processors 
in  an  irregular  pattern.  By  using  a  distributed  translation  table,  it  is  possible  to  avoid 
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replicating  records  of  where  distributed  array  elements  are  stored  in  all  processors. 
Level  0  primitives  also  carry  out  off-processor  accumulations;  e.g.  any  processor  can 
add  to  the  contents  of  an  off-processor  memory  location. 

1.3  Primitives  that  exist  but  are  not  yet  distributed 

There  are  additional  level  0  primitives  not  included  with  this  release  that  support  local 
caching  of  copies  of  off-processor  data.  These  Level  0  primitives  are  presented  in  [3] 
and  will  be  available  in  future  PARTI  releases.  Level  1  primitives,  also  not  available 
with  this  release,  allow  users  to  specify  how  distributed  arrays  are  to  be  mapped 
onto  sets  of  processors.  The  level  1  primitives  support  read,  write  and  accumulate 
accesses  to  these  mapped  multidimensional  arrays.  The  level  1  primitives  also  allow 
users  to  dynamically  remap  distributed  arrays.  The  Level  1  primitives  are  described 
in  [1].  It  should  be  noted  that  use  of  PARTI  primitives  do  not  interfere  with  access 
to  traditional  message  passing  communications  primitives.  In  particular,  a  user  can 
call  all  of  the  iPSC  supplied  routines  when  using  PARTI. 

2  Installation 

2.1  Getting  PARTI 

PARTI  can  be  had  in  either  several  shar  files  or  one  tar  file.  The  tar  file  is  in  general 
more  convinient,  but  the  shar  files  can  be  sent  through  the  mail.  PARTI  can  be 
obtained  by  anonymous  ftp  from  ra.cs.yale.edu ,  from  netlib,  or  by  contacting: 

Raja  Das 
ICASE 

Mail  Stop  132C 

NASA  Langley  Research  Center 

Hampton,  Va  06511 

(804)  864-8004 

raj  aQicase.edu 

If  you  have  the  PARTI  tar  file,  just  change  to  the  directory  where  you  wish  to  put 
the  PARTI  subdirectory  and  type: 

tar  xof  parti. tar 
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[  If  you  have  the  shar  files,  things  are  only  mildly  worse.  You  need  the  following 

files:  docs.shar,  free.shar,  matmult.shar,  papers.shar,  src.shar,  tests.shar,  unst.shar 
and  a  makefile  (called  “makefile”,  oddly  enough.)  Put  these  files  in  the  directory 
where  you  want  the  PARTI  subdirectory  and  type 

make  unshar 

2.2  Building  PARTI 


Either  of  the  above  installation  procedures  should  create  the  following  directory  struc¬ 
tures: 


parti/docs  documentation  in  latex,  postscript  and  plain  text 

parti/examples/matmult  sparse  matrix  multiplication  described  in  Section  B 

parti/examples/unst  sweep  over  unstructured  mesh,  described  in  section  A. 

parti/examples/free  a  conjugate  gradient  linear  equation  solver  cg.c  and  cg_host.c 
not  discussed  in  this  documentation.  (Free  prize  included  in  every  copy  of 
PARTI!).  Also  included  is  simplex,  a  simple  example  involving  several  of  the 
primitives. 

parti/papers  some  of  the  relevant  papers 
parti/src  source  for  the  PARTI  primitives 
parti/tests  test  programs  to  verify  correct  installation 

A  makefile  should  be  present  in  the  PARTI  directory.  At  the  beginning  of  this 
makefile  are  several  macros  to  be  modified  by  the  user. 

NFL  AG  This  macro  is  passed  to  the  C  compiler  and  linker  when  compiling  and/or 
linking  node  programs.  It  should  have  one  of  the  following  values: 

-node  -sx  for  iPSC/2  machines  with  weitek  floating  point  accelerators 
-node  -i860  for  iPSC/860  machines 
-node  for  vanilla  iPSC/2  machines 
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NARC  This  macro  indicates  the  archive  to  be  used  in  creating  the  PARTI  library. 
It  should  be  set  to  one  of  the  following: 


ar  for  any  iPSC/2 
ar860  for  an  iPSC/860 

LIB  This  macro  should  be  set  to  the  directory  where  the  party  library  will  be  in¬ 
stalled.  It  is  prudent  to  use  the  full  path  name  here.  This  directory  must  exist 
before  the  system  is  installed. 

INCL  This  macro  should  be  set  to  the  directory  where  the  PARTI  include  files  will 
reside.  It  is  prudent  to  use  the  full  path  name  here.  This  directory  must  exist 
before  the  system  is  installed. 

NPROCS  This  indicates  the  largest  number  of  processors  that  the  tests  should  be 
run  on.  Eight  and  sixteen  are  good  values. 

NODECC  This  macro  should  be  set  to  the  C  compiler  which  will  compile  the  node 
programs.  The  default  compiler  (cc)  is  always  a  correct  choice.  The  pgcc 
compiler  may  also  be  used  where  appropriate. 

NODEF77  This  macro  should  be  set  the  Fortran  compiler  to  be  used  to  compile 
the  node  programs.  The  default  compiler  (f77)  is  always  a  correct  choice.  The 
pgf77  compiler  may  be  used  where  appropriate. 

Make  sure  that  the  directories  pointed  to  by  LIB  and  INCL  exist.  If  they  do  not,  any 
attempt  to  install  the  party  system  there  will  fail.  There  are  several  objects  to  make. 
Typing  the  following  make  commands  in  the  listed  order  should  be  sufficient  to  install 
and  check  the  PARTI  system  on  your  computer. 

make  will  compile  the  PARTI  library  but  not  install  it  in  the  designated  directories, 
make  install  will  install  the  PARTI  system  in  the  designated  directories, 
make  clean  will  remove  object  and  executable  file  from  various  subdirectories, 
make  test  will  run  several  tests  to  see  if  everything  has  been  compiled  correctly. 
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3  Function  Descriptions 

3.1  Header  Files 

There  are  two  header  files  which  go  with  the  PARTI  library.  The  first  is  parti  .h.  This 
file  contains  the  definitions  of  all  structures,  macro  definition  and  function  definitions 
needed  to  run  the  PARTI  primitives.  It  must  be  included  in  all  C  programs  that  use 
the  PARTI  system.  The  second  include  file,  parti-more  .h,  is  used  only  when  the 
system  is  compiled.  It  defines  such  things  as  message  types,  and  static  buffer  lengths. 

It  should  not  be  necessary  to  include  this  file  in  applications  which  use  PARTI.  No 
header  files  need  be  included  in  Fortran  applications. 

Two  of  the  primitives  schedule  and  build_translation_table  are  functions  that 
carry  out  preprocessing,  schedule  and  build_translation_table  allocate  elements 
of  structures  schedule_struct  and  trans-table  and  then  return  pointers  to  struc¬ 
tures.  The  above  structures  are  defined  in  parti.h;  macro  definitions  define  struct 
schedule_struct  as  SCHED  and  define  struct  trans_table  as  TTABLE.  parti.h 
also  defines  macros  STRIPED  and  BLOCKED  used  in  the  procedure  build_translation_table 

3.2  Level  0  primitives 

Level  0  gathers  and  scatters  are  accomplished  by  using  three  routines:  Scheduler, 

Gather,  and  Scatter. 

Scheduler  on  each  processor  is  passed  a  list  of  indices  Kj  into  aloe  on  each  proces¬ 
sor  j .  Scheduler  produces  a  schedule  S  that  controls  the  data  that  are  to  be  fetched 
off-processor  by  Gather  or  scattered  off-processor  by  Scatter  . 

On  each  processor,  Gather  inputs 

1.  a  buffer  into  which  the  fetched  elements  are  to  be  placed 

2.  a  pointer  to  local  array  aloe 

3.  the  schedule  S  produced  by  Scheduler 

In  Fig.  1  we  introduce  a  running  example  to  illustrate  the  Scheduler,  Gather  and 
Scatter.  In  this  example  we  have  three  processors,  each  processor  is  passed  a  set  of 
off-processor  indices. 

Gather  executes  sends  and  receives  that  fetch  from  processor  j  the  appropriate 
elements  from  the  array  aloe  on  processor  j.  Then  it  places  these  elements  into 
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Figure  1:  Scheduler  Example 


Scheduler: 

inputs  list  of  indices  on  each  processor 
outputs  a  schedule  S 

E.g. 

processor  1:  (processor  2,  index  5),  (processor  3,  index  7) 
processor  2:  (processor  1,  indices  4,  5,  6),  (processor  3  index  2) 
processor  3:  (processor  1,  index  1), (processor  2  indices  1,  3,  4) 


the  user-supplied  buffer.  Fig.  2  continues  the  running  example  begun  in  Fig.  1.  On 
processor  j  the  array  aloe  is  initialized  as  aloc(i)  =  j  *  100  +  i  for  1  <  i.  We 
depict  the  contents  of  buffer  on  each  processor  after  Gather  is  executed. 

Scatter  is  passed 

1.  a  buffer  from  which  each  scattered  datum  is  to  be  obtained 

2.  a  pointer  to  local  array  aloe 

3.  the  schedule  S  produced  by  Scheduler 

Scatter  executes  sends  and  receives  that  put  on  processor  j  the  appropriate  elements 
from  the  buffer.  Then  Scatter  places  these  elements  into  the  appropriate  elements  of 
array  aloe  on  processor  j.  Fig.  3  continues  the  running  example.  We  assume  that  on 
processor  j ,  we  initialize  buffer  as  buf  f  er(i)  =  j  *  100  +  i  for  1  <  i,  we  initialize 
aloe  so  that  aloc(i)  =  0.  After  Scatter  executes,  we  depict,  on  each  processor  j 
the  contents  of  aloe. 

3.2.1  Functioning  of  the  Scheduler,  Gather  and  Scatter 

Both  the  procedures  Scatter  and  Gather  have  three  stages.  They  permute  data  into 
buffers  to  be  sent.  They  perform  the  needed  communication,  then  they  perform 
another  permutation. 
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Figure  2:  Gather  Example 

Gather: 

inputs  schedule  S  produces  by  Scheduler 

inputs  pointer  to  local  array  aloe  from  which  gathered  elements  are  to 
fetched 

outputs  fetched  elements  placed  in  local  array  buffer 
E.g.  assume 

processor  1:  aloc(i)  =  100  +  i  ,  1  <  i 

processor  2:  aloc(i)  =  200  +  i  ,  1  <  i 

processor  3:  aloc(i)  =  300  +  i  ,  1  <  i 

Gather  returns: 


buffer 

Processor 

1 

Processor 

2 

Processor 

3 

1 

205 

104 

101 

2 

307 

105 

201 

3 

- 

106 

203 

4 

- 

302 

204 
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Figure  3:  Scatter  Example 


Scatter: 

inputs  schedule  S  produces  by  Scheduler 

inputs  elements  to  be  scattered,  these  are  placed  in  local  array  buffer 
outputs  scattered  elements,  these  are  placed  in  local  array  aloe 

E.g.  assume 


processor  1 

:  buff er(i)  =  100  +  i  , 

1  <  i 

processor  2 

!:  buff er(i)  =  200  +  i  , 

1  <  i 

processor  3 

buff er(i)  =  300  +  i  , 

i  <  i 

processor  1 

:  aloc(i)  =  0,  1  <  i 

processor  2 

!:  aloc(i)  =  0,  1  <  i 

processor  3 

i:  aloc(i)  — -  0,  1  <  i 

After  Scatter  is  called: 


aloe 

Processor 

1 

Processor 

2 

Processor 

3 

1 

301 
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0 

2 

0 

0 
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0 

303 

0 

4 
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101 

0 

6 

203 

0 

0 

7 
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The  scheduler  first  determines  how  many  messages  each  processor  must  send  and 
receive  during  the  data  exchange  phase.  Defined  on  processor  j  is  an  array  nmsgsh 
Processor  j  sets  the  value  of  nmsgs^(i)  to  1  if  it  needs  data  from  processor  i  or  to  0 
if  it  does  not.  The  scheduler  then  replaces  nmsgs3  with  the  element-by-element  sum 
nmsgsj(i)  <—  ^nmsgsk(i).  This  operation  utilizes  a  function  that  imposes  a  fan-in 
tree  to  find  the  sums.  Since  the  resulting  sum  is  kept  in  nrasgs^,  at  the  end  of  the 
fan-in  on  every  processor,  nmsgsJ(i)  is  the  number  of  messages  that  processor  must 
send  during  the  exchange  phase.  Next,  each  processor  sends  a  request  list  to  every 
other  processor.  The  request  list  sent  from  processor  p  to  processor  q  contains  the 
indices  of  data  needed  by  processor  p  that  are  stored  on  processor  q. 

The  number  of  non-empty  request  lists  each  processor  will  receive  is  equal  to 
the  number  of  messages  that  the  processor  will  send  in  the  gather  or  scatter  phase. 
Each  request  list  is  placed  in  an  array  indexed  by  the  processor  from  which  the  list 
came.  When  the  scheduler  is  finished,  each  processor  has  an  array  of  request  lists 
obtained  from  other  processors.  The  jth  element  of  this  array  contains  the  request 
list  obtained  from  processor  j .  At  this  point  in  the  execution,  each  processor  i  knows 
which  elements  of  aloe  local  to  processor  i  that  must  be  sent  to  other  processors. 
This  information  is  used  to  generate  the  schedule  S  of  pairs  of  send  and  receive 
statements.  These  send/receive  pairs  will  exchange  the  requested  data  for  either  a 
gather  or  a  scatter.  The  gather  or  the  scatter  is  passed  the  schedule  S  with  the 
required  buffer  space.  It  then  carries  out  the  required  communication. 


3.3  schedule() 

This  procedure  carries  out  the  preprocessing  needed  for  carrying  out  optimized  gather 
exchanger  and  scatter  exchanger  routines.  Every  processor  must  participate  in  this 
procedure  call.  On  each  processor,  a  schedule  is  passed  a  list  of  processors  and  local 
indices  from  which  a  gather  procedure  on  that  processor  can  later  obtain  da'  i  (or  to 
which  a  scatter  procedure  on  that  processor  can  later  write  data),  schedule  returns 
a  pointer  to  a  structure  of  type  SCHED,  this  pointer  is  used  in  gather,  scatter  and 
scatter-FUNC  operations  (Sections  3.4,  3.5,  3.6). 


Synopsis 

SCHED  *schedule(local,proc,ndata) 
Parameter  declarations 


9 


int  *local  local  index  to  be  gathered  from  or  scattered  to 
int  *proc  processors  to  be  gathered  from  or  scattered  to 
int  ndata  number  of  data  involved  in  gather  or  scatter 

Return  value 

Returns  pointer  to  structure  of  type  SCHED  which  can  be  used  in  PREFIXgather. 
PREFIXscatter  ^REFIXscatter-add.  PREFIXscatter-sssb.  PREFIXscaiter_muIt. 

Example 

Node  0  schedules  a  fetch  of  elements  1  and  2  from  a  (so  far  unspecified)  array  oa 
node  1;  node  1  schedules  a  fetch  of  element  1  from  an  array  on  node  0  and  0  from 
an  array  on  node  1. 


int  local [2],  proc[2] ,  ndata; 
SCHED  *schedinfo; 


if (nynode () ==0) { 
proc[0]  =  1; 
local [0]  =  1; 
proc[l]  =  1; 
local [1]  =  2; 
ndata  =  2; 

} 

if  (mynode()==l)-( 
proc[0]  =  0; 
local [0]  =  1; 
proc[l]  =  1; 
local [1]  =  0; 
ndata  =  2; 

> 
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schedinfo  =  s«±eat2le(Isscal,prcc,noata)  ; 


3.4  PRBFIXgather() 

PREFIX  cast  be  d  (doable  precision),  i  (integer)  ,  f  (floating  point)  or  c  (character) 
This  procedure  is  the  gather  exchanger  procedure  described  above  and  in  [1].  PRE- 
FJXgather  uses  a  schedule  produced  by  a  call  to  schedule,  the  schedule  is  passed  to 
PREFIXgather  in  structure  SCHED  schedinfo.  Copies  of  daia  values  obtained  from 
other  processors  are  placed  in  memory  pointed  to  by  buffer.  Also  passed  to  PREFIX 
gather  is  a  pointer  to  the  location  from  which  data  is  to  be  fetched  on  the  calling 
processor.  This  pointer  is  designated  here  as  aloe,  aloe  corresponds  to  alod  above 
and  in  [1]. 


Synopsis 

void  PREFIXgather(schedinfo.buffer,aloc) 

Parameter  Declarations 

SCHED  *schedinfo  information  obtained  from  schedule’s  preprocessing  of  ref¬ 
erence  pattern 

TYPE  ^buffer  pointer  to  buffer  for  copies  of  gathered  data  values 
TYPE  *aloc  location  from  which  data  is  to  be  fetched  from  calling  processor 

Return  Value 

None 

Example 
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We  assume  that  schedule  has  already  been  called  with  the  parameters  presented 
in  Section  3.3.  Our  example  will  assume  that  we  wish  to  gather  double  precision 
numbers,  i.e.  that  we  will  be  calling  dgather.  On  each  processor,  *aloc  points  to 

the  arrays  from  which  values  are  to  be  obtained.  *buffer  points  to  the  location 
into  which  will  be  placed  copies  of  data  values  obtained  from  other  processors. 


double  buffer [2] ,  aloe [3]; 
SCHED  *schedinfo; 


for (1=0 ; i<3 ; i++) { 

alocfi]  =  mynodeO  +  0.1*i; 

> 

dgather (schedinfo, buffer , aloe) ; 


On  processor  0,  buffer[0]  and  buffer[l]  are  now  equal  to  1.1  and  1.2.  On  processor 
1,  buffer[0]  and  buffer[l]  are  now  equal  to  0.1  and  1.0. 


3.5  PREFIXscatter() 

PREFIX  can  be  d  (double  precision),  i  (integer)  ,  f  (floating  point)  or  c  (character). 
This  procedure  is  the  scatter  exchanger  procedure  described  above  and  in  [1].  PRE- 
FlXscatter  uses  a  schedule  produced  by  a  call  to  schedule,  the  schedule  is  passed  to 
PREFIXscatter  in  structure  SCHED  schedinfo.  Copies  of  data  values  to  be  scattered 
to  other  processors  are  placed  in  memory  pointed  to  by  buffer.  Also  passed  to  PRE¬ 
FIX  scatter  is  a  pointer  to  the  location  to  which  copies  of  data  are  to  be  written 
on  the  calling  processor.  This  pointer  is  designated  here  as  aloe,  aloe  corresponds  to 
aloe1  above  and  in  [1]. 
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Synopsis 


void  PREFIXscatter(schedinfo, buffer, aloe) 

Parameter  Declarations 

SCHED  schedinfo  information  obtained  from  schedule’s  preprocessing  of  refer¬ 
ence  pattern 

TYPE  *buffer  points  to  data  values  to  be  scattered  from  a  given  processor 

TYPE  *aloc  points  to  first  memory  location  on  calling  processor  for  scattered 
data 

Return  Value 
None 
Example 

We  assume  that  schedule  has  already  been  called  with  the  parameters  presented 
in  Section  3.3.  Our  example  will  assume  that  we  wish  to  scatter  double  precision 
numbers,  i.e.  that  we  will  be  calling  dscatter.  On  each  processor,  *aloc  points  to 
the  arrays  to  which  values  are  to  scattered,  ^buffer  points  to  the  location  from 
which  will  be  obtained  data  that  will  be  scattered  The  processor  and  local-array 
index  to  which  the  values  are  to  be  scattered  was  designated  during  an  earlier  call 
to  schedule. 


double  buffer [2],  aloe [3]; 
SCHED  *schedinf o; 


for(i=0;i<3;i++){ 
alocCi]  =  10.0; 

> 

if (mynode ()==0){ 
buffer [0]  =  444.44; 
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buffer [1]  =  555.55; 


> 

if (mynode ()==!){ 
buffer [0]  =  666.66; 
buffer [1]  =  777 .77; 

> 

dscatter (schedinf o .buffer , aloe) ; 


On  processor  0,  the  first  three  elements  of  aloe  are  10.0,  666.66  and  10.0.  On 
processor  1,  the  first  three  elements  of  aloe  are  777.77,  444.44  and  555.55. 


3.6  PREFIXscatter_FUN  C  () 

PREFIX  can  be  d  (double  precision),  i  (integer)  ,  f  (floating  point)  or  c  (character). 
FUNC  can  be  add,  sub  or  mult  .  PREFIXscatter  stores  data  values  to  specified 
locations.  PREFIXscatter  JUNC  allows  one  processor  to  specify  computations  that 
are  to  be  performed  on  the  contents  of  given  memory  location  of  another  processor. 
The  procedure  is  in  other  respects  analogous  to  PREFIXscatter. 


Synopsis 

void  PREFIXscatter  J*'UNC(schedinfo, buffer, aloe) 

Parameter  Declarations 

SCHED  *schedinfo  information  obtained  from  schedule’s  preprocessing  of  ref¬ 
erence  pattern. 

TYPE  ^buffer  points  to  data  values  that  will  form  operands  for  the  specified 
type  of  remote  operation. 

TYPE  *aloc  points  to  first  memory  location  on  calling  processor  to  be  used  as 
targets  of  remote  operations. 
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Return  Value 
None 
Example 

We  assume  that  schedule  has  already  been  called  with  the  parameters  presented 
in  Section  3.3.  Our  example  will  assume  that  we  wish  to  scatter  and  add  double 
precision  numbers,  i.e.  that  we  will  be  calling  dscatter.add.  On  each  processor, 

*aloc  points  to  the  arrays  to  which  values  are  to  be  scattered  and  added,  ^buffer 
points  to  the  location  from  which  will  be  obtained  the  values  to  be  scattered  and 
added.  The  processor  and  local-array  index  to  which  the  values  are  to  be  scattered 
and  added  was  designated  during  an  earlier  call  to  schedule. 


double  buffer [2],  aloe [3]; 
SCHED  *schedinfo; 


for(i=0;i<3;i++){ 
alocCi]  =  10.0; 

> 

if (mynode () ==0) { 
buffer [0]  =  444.44; 
buffer [1]  =  555.55; 

} 

if (mynode ()==1){ 
buffer [0]  =  666.66; 
buffer [1]  =  777.77; 

> 

dscatter_add(schedinf o ,buf f er , aloe) ; 
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On  processor  0,  the  first  three  elements  of  aloe  are  10.0,  676.66  and  10.0.  On 
processor  1,  the  first  three  elements  of  aloe  are  787.77,  454.44  and  565.55. 


3.7  build_translation_table() 

In  order  to  allow  a  user  to  assign  globally  numbered  indices  to  processors  in  an  irregu¬ 
lar  pattern,  it  is  useful  to  be  able  to  define  and  access  a  distributed  translation  table. 
By  using  a  distributed  translation  table,  it  is  possible  to  avoid  replicating  records  of 
where  distributed  array  elements  are  stored  in  all  processors.  The  distributed  table 
is  itself  partitioned  in  a  very  regular  manner.  A  processor  that  seeks  to  access  an 
element  I  of  a  irregularly  distributed  data  array  is  able  to  compute  a  simple  function 
that  designates  a  location  in  the  distributed  table;  the  location  of  the  actual  array 
element  sought  is  obtained  from  the  distributed  table. 

The  procedure  build_translation_table  constructs  a  distributed  translation  table. 
It  assumes  that  distributed  array  elements  are  globally  numbered.  Each  processor 
passes  build-translation_table  a  set  of  indices  for  which  it  will  be  responsible.  The 
distributed  translation  table  may  be  striped  or  blocked  across  the  processors.  With 
a  striped  translation  table,  the  translation  table  entry  for  global  index  I  is  stored  in 
processor  (I  modulo  number-of_processors);  the  local  index  of  the  translation  table 
is  (1/  number-of-processors).  In  a  blocked  translation  table,  translation  table  entries 
are  partitioned  into  a  number  of  equal  sized  ranges  of  contiguous  integers,  these 
ranges  are  placed  in  consecutively  numbered  processors.  With  blocked  partitioning, 
the  block  corresponding  to  index  I  is  (I/B)  and  the  local  index  is  (I  modulo  B), 
where  B  is  the  size  of  the  block.  Let  M  be  the  maximum  global  index  passed  to 
build_translation_table  by  any  processor  and  NP  represent  the  number  of  processors; 
B  =  [M/NP]. 

build-translation.table  returns  a  pointer  to  a  structure  of  type  TTABLE;  this 
pointer  is  used  in  dereference,  defined  in  section  3.8. 


Synopsis 

TTABLE  *build_translation.table(part,indexarray,ndata) 
Parameter  Declarations 
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int  part  how  translation  table  will  be  mapped  -  may  be  BLOCKED  or  STRIPED 

int  *indexarray  each  processor  P  specifies  list  of  globally  numbered  indices  for 
which  P  will  be  responsible 

int  ndata  number  of  indices  for  which  processor  P  will  be  responsible 
Return  Value 

structure  of  type  TTABLE;  this  structure  contains  a  given  processor’s  portion  of 
the  distributed  translation  table 

Example 

An  example  to  demonstrate  the  use  of  both  build-translation-table  and  dereference 
can  be  found  in  Section  3.8. 


3.8  dereference  () 

dereference  accesses  the  distributed  translation  table  constructed  in  build-translation-table. 

dereference  is  passed  a  pointer  to  a  structure  of  type  TTABLE;  this  structure  de¬ 
fines  the  irregularly  distributed  mapping  created  in  procedure  build-translation-table, 
dereference  is  passed  an  array  with  global  indices  that  need  to  be  located  in  distributed 
memory;  dereference  returns  arrays  local  and  proc  that  contain  the  processors  and 
local  indices  corresponding  to  the  global  indices. 


Synopsis 

void  dereference(index.±able, global, local, proc, ndata) 

Parameter  declarations 

int  ^global  list  of  global  indices  we  wish  to  locate  in  distributed  memory 

int  *local  local  indices  obtained  from  the  distributed  translation  table  that  cor¬ 
respond  to  the  global  indices  passed  to  dereference 

int  *proc  array  of  distributed  translation  table  processor  assignments  for  each 
global  index  passed  to  dereference 
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Table  1 

:  Values  obtained  1 

by  dereference 

Processor 

proc[0] 

local[0] 

proc[l] 

local[l] 

0 

0 

0 

1 

0 

1 

1 

1 

0 

1 

int  ndata  number  of  elements  to  be  dereferenced 

TTABLE  *index_table  distributed  translation  table  datastructure  created  in 
build_translation_table 

Return  value 

None 

Example 

A  one  dimensional  distributed  array  is  partitioned  in  some  irregular  manner  so  we 
need  a  distributed  translation  table  to  keep  track  of  where  one  can  lind  the  value 
of  a  given  element  of  the  distributed  array. 

In  the  example  below,  we  initialize  a  translation  table.  Processor  0  calls  build-translation-tab] 
and  assigns  indices  0  and  3  to  processor  0,  processor  1  calls  build_translation_table 
and  assigns  indices  1  and  2  to  processor  1.  The  translation  table  is  partitioned 
between  processors  in  blocks. 

Processor  0  then  uses  the  translation  table  to  dereference  global  variables  0  and  1, 
processor  1  uses  the  translation  table  to  dereference  global  variables  2  and  3.  On 
each  processor,  dereference  carries  out  a  translation  table  lookup.  The  values  of 
proc  and  local  are  returned  by  dereference  are  shown  in  Table  1).  The  user  gets 
to  specify  the  processor  to  which  each  global  index  is  assigned,  note  however  that 
build-translation_table  assigns  local  indices. 


#include  <stdio.h> 

#include  "parti. h" 

mainO 

int  size,  i,  *index_array; 
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int  *deref .array; 
int  *local}  *proc; 
TTAELE  stable; 


size  =  2; 

index_array  =  (int  *)  malloc(sizeof (int)*size) ; 
deref_array  =  (int  *)  malloc(sizeof (int)*size) ; 
local  =  (int  *)  malloc(sizeof (int)*size) ; 
proc  =  (int  *)  malloc(sizeof (int)*size) ; 


/’•'Assign  indices  0 
if (mynode()==0) 

{ 

index.array[0] 

index_array[l] 

> 

/♦Assign  indices  1 
if (mynode()==i) 

{ 

index_array [0] 
index.arraytl] 
> 


and  3  to  processor  0  */ 


=  0; 

=  3; 

and  2  to  processor  1  */ 


/*  set  up  a  translation  table  */ 

table  =  build_translation_table(BLOCKED,index_array,size) 


/*  Processor  0  seeks  processor  and  local  indices 
for  global  array  indices  0  and  i  */ 
if (mynode()==0) 

{ 

deref .array [0]  =  0; 
deref .array [1]  =  i; 

> 

/*  Processor  1  seeks  processor  and  local  indices 
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for  global  array  indices  2  and  3  */ 
if (mynode()==l) 

{ 

deref .array [0]  =  2; 
deref .array [1]  =  3; 

> 

/*  Dereference  a  set  of  global  variables  */ 

dereference (table , deref .array , local , proc , size) ; 

/*  local  and  proc  return  the  processors  and  local  indices  where 
global  array  indices  are  stored. 

In  processor  0,  proc[0]  =  0,  proc[l]  =  1,  local [0]  =  0  ,  local [i]  =  0 
In  processor  1,  proc[0]  =  1,  proc[l]  =  0,  local [0]  =  1  ,  local [1]  =  1 
*/ 

} 


Now  assume  that  processor  0  needs  to  know  to  values  of  distributed  array  elements 
0,1,  and  3  while  processor  1  needs  to  know  the  value  of  element  2.  We  call  deref¬ 
erence  to  find  the  processors  and  the  local  indices  that  correspond  to  each  global 
index.  At  this  point  schedule  can  be  called  and  gathers  and  scatters  carried  out. 


3.9  localize() 

When  loops  access  data  residing  off  processor,  some  pre-processing  is  necessary  before 
these  iwps  can  be  executed.  The  pre-processing  involves  setting  a  schedule  to  bring 
in  the  off-processor  data,  and  changing  all  the  global  references  to  local  ones.  The 
primitive  localize  makes  calls  to  dereference  and  schedule  to  do  all  the  necessary 
processing.  The  schedule  pointer  returned  by  localize  is  used  to  gather  data  and 
store  it  at  the  end  of  the  local  array.  This  schedule  pointer  is  created  such  that 
multiple  copies  of  the  same  data  is  not  brought  in  during  the  gather  phase.  The 
elemination  of  duplicates  is  achieved  by  using  a  hash  table.  Localize  returns  the 
local  reference  string  corresponding  to  the  global  references  which  are  passed  as  a 
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parameter  to  it.  The  number  of  off  processor  data  elements  are  also  returned  by 
localize  so  that  one  can  allocate  enough  space  at  the  end  of  the  local  array. 


Synopsis 

void  localize(tabptr,lsched,global_refs,  local_refs,ndata,n-ofLproc,my-size) 

Parameter  Declarations 

TTABLE  *tabptr  pointer  to  the  distributed  translation  table,  build  for  the  local 
array  being  dealt  with. 

SCHED  **lsched  pointer  to  the  data  structure  for  schedule,  which  stores  all  the 
send  receive  information  (returned  by  localize). 

int  *global_refs  pointer  to  the  array  which  stores  all  the  global  reference  string. 

int  *local_refs  pointer  to  the  array  which  stores  the  local  reference  string  corre¬ 
sponding  to  the  global  references  (returned  by  localize). 

int  ndata  number  of  global  references. 

int  *n_off_proc  address  of  the  number  of  off  processor  data  (returned  by  localize), 
int  my  .size  the  size  of  my  local  array. 

Return  Value 

None 

Example 

Nodes  0  and  1  takes  part  in  a  computation  which  involves  a  loop  which  refers  to 
data  residing  off  processor.  The  irregularly  distributed  arrays  are  x  and  y.  Both 
the  arrays  have  the  same  distribution  pattern.  Node  0  contains  global  indices  0,  1 
and  2,  while  node  1  contains  3,  4,  5,  6  and  7.  During  the  actual  computation  both 
nodes  0  and  1  needs  to  access  certain  elements  of  the  y  array.  The  global  indices 
that  node  0  has  to  access  is  3,  7  and  1,  and  node  1  has  to  access  4,  2,  3,  0  and  6. 
Now  we  will  present  the  inspector-executor  code  for  the  senario  described  above. 
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#define  BLOCKED  1 


int  i,ndat a, indirection; 

int  local [5] ,global_ref [5] ,local_ref [5] ; 

double  x[5]  ,y[10] ; 

TTABLE  *tabptr; 

SCHED  *schedptr; 


/*  the  following  is  the  inspector  code  */ 

if(mynode()  ==  0){ 
local [0]  =  0; 
local [1]  =  1; 
local [2]  =  2; 
ndata  =  3; 

tabptr  -  build_translation_table(BLOCKED, local, ndata) ; 
global_ref [0]  =  3; 
global.ref [1]  *  7; 
global_ref [2]  =  1; 

localize (tabptr, &schedptr,global_ref, 

local_ref ,ndata,&n_off_proc,3) ; 

}  else  { 

local [0]  =  3; 
localfl]  =  4; 
local [2]  =  5; 
local [3]  =  6; 
local [4]  =  7; 
ndata  =  5; 

tabptr  =  build_translation_table(BLOCKED, local, ndata) ; 

global_ref [0]  =  4; 

global_ref [1]  =  2; 

global_ref [2]  =  3; 

global.ref [3]  =  0; 

global_ref [4]  =  6; 

localize(tabptr,&schedptr,global_ref , 
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} 


local_ref ,ndata,&n_off_proc,5) ; 


/*  end  of  the  inspector.  Let  us  assign  values  to 
the  distributed  arrays  */ 

f or (i=0 ; i<ndata; i++) { 
x[i]  =  i; 
y[i]  =  2*i; 

> 

/*  the  following  is  the  executor  code  */ 

dgather(schedptr,&y[ndata] ,y) ; 

f or(i=0 ; iCndata; i++){ 

indirection  =  local_ref [i] ; 
x[i]  =  x[i]  +  3  *  y [indirection] ; 

} 

/*  end  of  the  executor  code  */ 


After  the  end  of  the  computation  in  processor  0  the  values  of  x[0],  x[l]  and  x[2] 
are  0.0,  25.0  and  8.0  respectively.  On  processor  1  the  values  of  x[0],  x[2],  x[3],  x[4] 
and  x[5]  are  6.0,  13.0,  2.0,  3.0  and  22.0  respectively.  For  a  detailed  example  in 
FORTRAN  refer  to  appendix  B. 


4  Calling  the  primitives  from  FORTRAN 

This  section  shows  how  the  primitives  can  be  used  with  FORTRAN.  We  will  go 
through  the  examples  described  in  section  3  using  the  FORTRAN  version  of  the 
PARTI  primitives. 
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4.1  function  ifscheduIeQ 

This  function  returns  an  integer  which  can  he  used  to  refer  to  the  sdfcedole  c©*r?- 
sponding  to  the  input  data..  His  integer  is  used  in  gather,  scatter  and  scatter-TffNC 
operations  (Sections  4.2, 4.3, 4.4). 


Synopsis 

function  ifecheduie(iIoc2liprocj:data) 

Parameter  declarations 

integer  ilocal()  local  indices  to  be  gathered  from  ox  scattered  to 
integer  iproc()  processors  to  be  gathered  from  or  scattered  to 
integer  ndata  number  of  data  elements  involved  in  gather  or  scalier 

Return  value 

Returns  a  reference  to  a  schedule  which  can  be  used  in  PREFIXfgather,  PREFIXf- 
scatter,  PREFIXfscatterjadd,  PREFIXfecatter-sub,  PREFIXfscatter_mult. 

Example 

Node  0  schedules  a  fetch  of  elements  1  and  2  from  a  (so  far  unspecified)  array  on 
node  1;  node  1  schedules  a  fetch  of  element  1  from  an  array  on  node  0  and  3  from 
an  array  on  node  1. 


logical  ifschedule 

integer  ilocal(2),  iproc(2),  ndata 

integer  ischedinfo 


if (mynode() .eq.0){ 
iproc(l)  =  1 
ilocal(l)  =  1 


24 


iproc(2)  =  1 
iiccal(2)  =  2 
ndata  =  2 

3 

if  (aEySodeO  -eq.lM 
iprcc(l)  =  0 
iiocal(i)  =  1 
iproc(2)  =  1 
iiocal(2)  =  3 
udata  =  2 

> 

ischedinio  =  ixschedule(ilocal,  iproc,xtdata) 


4.2  subroutine  PREFIXfgather() 

PREFIX  can  be  d  (double  precision),  i  (integer)  ,  f  (real)  or  c  (character).  For  more 
information  refer  to  Section  3.4. 

Synopsis 

subroutine  PREFIXfgather(ischedinfo,bulFer,aloc) 

Parameter  Declarations 

integer  ischedinfo  refers  to  the  relevant  schedule 

TYPE  bufferQ  pointer  to  buffer  for  copies  of  gathered  data  values 

TYPE  alocQ  location  from  which  data  is  to  be  fetched  from  calling  processor 

Return  Value 
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None 

Example 

We  assume  that  schedule  has  already  been  called  with  the  parameters  presented 
in  Section  4.1.  Our  example  will  assume  that  we  wish  to  gather  double  precision 
numbers,  i.e.  that  we  will  be  calling  dfgather.  On  each  processor,  aloe  points  to 
the  arrays  from  which  values  are  to  be  obtained,  buffer  points  to  the  location  into 
which  will  be  placed,  copies  of  data  values  obtained  from  other  processors. 


double  precision  buffer(2),  aloc(3) 
integer  ischedinfo 


do  10  i=l,3 

aloc(i)  =  mynodeO  +  0.1*i 
10  continue 

call  dfgather (ischedinfo ,buf f er, aloe) 


On  processor  0,  buffer(l)  and  buffer(2)  are  now  equal  to  1.1  and  1.2.  On  processor 
1,  buffer(l)  and  buffer(2)  are  now  equal  to  0.1  and  1.3. 


4.3  subroutine  PREFIXfscatterQ 

PREFIX  can  be  d  (double  precision),  i  (integer)  ,  f  (real)  or  c  (character).  For  more 
information  refer  to  Section  3.5. 


Synopsis 

subroutine  PREFIXfscatter(ischedinfo,buffer,aloc) 
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Parameter  Declarations 


integer  ischedinfo  refers  to  the  relevant  schedule. 

TYPE  buffer ()  points  to  data  values  to  be  scattered  from  a  given  processor 

TYPE  aloc()  points  to  first  memory  location  on  calling  processor  for  scattered 
data 

Return  Value 
None 
Example 

We  assume  that  schedule  has  already  been  called  with  the  parameters  presented 
in  Section  4.1.  Our  example  will  assume  that  we  wish  to  scatter  double  precision 
numbers,  i.e.  that  we  will  be  calling  dfscatter.  On  each  processor,  aloe  points 
to  the  arrays  to  which  values  are  to  scattered,  buffer  points  to  the  location  from 
which  will  be  obtained  data  that  will  be  scattered  The  processor  and  local-array 
index  to  which  the  values  are  to  be  scattered  was  designated  during  an  earlier  call 
to  schedule. 


double  precision  buffer (2) ,  aloc(3) 
integer  ischedinfo 


do  10  i=l,3 
aloc(i)  =  10.0 
10  continue 

if  (mynodeO  .eq.O)  then 
buffer (1)  =  444.44 
buffer (2)  =  555. 55 
endif 


if  (mynodeO  .eq.l)  then 
buffer(l)  =  666.66 
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buffer (2)  =  777.77 
endif 

call  df scatter (ischedinfo, buffer, aloe) 


On  processor  0,  the  first  three  elements  of  aloe  are  666.66,  10.0  and  10.0.  On 
processor-  1,  the  first  three  elements  of  aloe  are  444.44,  555.55  and  777.77. 


4.4  subroutine  PREFIXfscatter_FUNC() 

PREFIX  can  be  d  (double  precision),  i  (integer)  ,  f  (real)  or  c  (character).  For  more 
information  refer  Section  3.6. 


Synopsis 

subroutine  PREFIXfscatter_FUNC(ischedinfo, buffer, aloe) 

Parameter  Declarations 

integer  ischedinfo  refers  to  the  relevant  schedule. 

TYPE  buffer()  points  to  data  values  that  will  form  operands  for  the  specified 
type  of  remote  operation. 

TYPE  aloc()  points  to  first  memory  location  on  calling  processor  to  be  used  as 
targets  of  remote  operations. 

Return  Value 

None 

Example 
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We  assume  that  schedule  has  already  been  called  with  the  parameters  presented 
in  Section  4.1.  Our  example  will  assume  that  we  wish  to  scatter  and  add  double 
precision  numbers,  i.e.  that  we  will  be  calling  dfscatter_add.  On  each  processor, 
aloe  points  to  the  arrays  to  which  values  are  to  be  scattered  and  added,  buffer 
points  to  the  location  from  which  will  be  obtained  the  values  to  be  scattered  and 
added.  The  processor  and  local-array  index  to  which  the  values  are  to  be  scattered 
and  added  was  designated  during  an  earlier  call  to  schedule. 


double  precision  buffer(2) ,  aloc(3) 
integer  ischedinfo 


do  10  i=l ,3 
aloc(i)  =  10.0 
10  continue 

if (mynode() .eq.O)  then 
buffer(l)  =  444.44 
buffer (2)  =  555.55 
end  if 

if (mynode() .eq. 1)  then 
buffer(l)  =  666.66 
buffer (2)  =  777.77 
endif 

call  df scatter. add(ischedinfo , buffer , aloe) 


On  processor  0,  the  first  three  elements  of  aloe  are  676.66,  10.0  and  10.0.  On 
processor  1,  the  first  three  elements  of  aloe  are  454.44,  565.55  and  787.77. 
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4.5  function  ifbuild_translation_table() 

For  detailed  information  refer  to  Section  3.7. 

Synopsis 

function  ifbuild_translation_table(part,indexarray, ndata) 

Parameter  Declarations 

integer  part  how  translation  table  will  be  mapped  -  may  be  BLOCKED  or 
STRIPED 

integer  indexarray()  each  processor  P  specifies  list  of  globally  numbered  indices 
for  which  P  will  be  responsible 

integer  ndata  number  of  indices  for  which  processor  P  will  be  responsible 
Return  Value 

integer  which  refers  to  the  translation  table  corresponding  to  the  input  data. 
Example 

An  example  to  demonstrate  the  use  of  both  build-translation-table  and  dereference 
can  be  found  in  Section  4.7. 

4.6  subroutine  flocalize() 

For  more  information  refer  to  Section  3.9 

Synopsis 

subroutine  fiocalize(itabptr,ilsched,iglobal.refs,  ilocal_refs, ndata, n_off_proc, my  .size) 
Parameter  Declarations 

integer  itabptr  refers  to  the  relevant  translation  table  pointer. 
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integer 

integer 

integer 

ing 

integer 

integer 

integer 


ilsched  refers  to  the  relevant  schedule  pointer  (returned  by  localize). 

iglobal_refs()  the  array  which  stores  all  the  global  reference  string. 

ilocal_refs()  the  array  which  stores  the  local  reference  string  correspond- 
to  the  global  references  (returned  by  localize). 

ndata  number  of  global  references. 

n_off_proc  number  of  off-processor  data  (returned  by  localize), 
my  .size  the  size  of  my  local  array. 


Return  Value 


None 

Example 

Nodes  0  and  1  takes  part  in  a  computation  which  involves  a  loop  which  refers  to 
data  residing  off  processor.  The  inspector  and  the  executor  code  is  presented  here. 


integer  i, ndata, indirect ion 

integer  local (5) , iglobal„ref (5) , ilocal_ref (5) 

double  precision  x(5),y(10) 

integer  itabptr 

integer  ischedptr 

logical  ifbuild_translation_table 


c  the  following  is  the  inspector  code 
BLOCKED  =  1 

if  (mynodeO  . eq.O)  then 
ilocal(l)  =  1 
ilocal(2)  =  2 
ilocal(3)  =  3 
ndata  =  3 
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mysize  =  3 

itabptr  =  ifbuild_translation_table(BLOCKED,ilocal, ndata) 
iglobal_ref (1)  =  4 
iglobal_ref (2)  =  8 
iglobal_ref (3)  =  2 

call  flocalize(itabptr,ischedptr,iglobal_ref , 
ilocal.ref, ndata, n_off_proc, mysize) 

else 

ilocal(l)  =  4 
ilocal(2)  =  5 
ilocal(3)  =  6 
ilocal(4)  =  7 
ilocal(5)  =  8 
ndata  =  5 
mysize  =  5 

itabptr  =  if build_translation_table (BLOCKED, iloeal, ndata) 

iglobal_ref (1)  =  5 

iglobal_ref (2)  =  3 

iglobal.ref (3)  =  4 

iglobal_ref (4)  =  1 

ig‘lobal_ref  (5)  =  7 

call  f localize (itabptr , ischedptr , iglobal_ref , 
ilocal.ref, ndata, n_off_proc, mysize) 

endif 

c 

do  10  i=l, ndata 

iglobal.ref (i)  =  ilocal_ref (i) 

10  continue 

c  end  of  the  inspector.  Let  us  assign  values  to 
c  the  distributed  arrays 

do  20  i=i, ndata 
x(i)  =  i 

y(i)  =  2*i 

20  continue 
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c  the  following  is  the  executor  code 

call  dfgather(ischedptr,y(ndata) ,y(l)) 

do  30  i=l,ndata 

indirection  =  iglobal_ref (i) 
x(i)  =  x(i)  +  3  *  y(indirection) 

30  continue 

c  end  of  the  executor  code 


After  the  end  of  the  computation  in  processor  0  the  values  of  x(l),  x(2)  and  x(3)  are 
25.0,  50.0  and  15.0  respectively.  On  processor  1  the  values  of  x(l),  x(2),  x(3),  x(4) 
and  x(5)  are  31.0,  20.0,  27.0,  10.0  and  47.0  respectively.  For  a  detailed  example  in 
FORTRAN  refer  to  appendix  B. 


4.7  subroutine  fdereference() 

For  more  information  about  this  section  refer  to  Section  3.8. 

Synopsis 

subroutine  fdereference(index.table, global, local, proc,ndata) 

Parameter  declarations 

integer  index-table  refers  to  the  relevant  translation  table 

integer  global ()  list  of  global  indices  we  wish  to  locate  in  distributed  memory 

integer  local()  local  indices  obtained  from  the  distributed  translation  table  that 
correspond  to  the  global  indices  passed  to  dereference 

integer  proc()  array  of  distributed  translation  table  processor  assignments  for 
each  global  index  passed  to  dereference 

integer  ndata  number  of  elements  to  be  dereferenced 
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Table  2:  Values  obtained  1 

by  dereference 

Processor 

proc(l) 

local(l) 

proc(2) 

local(2) 

0 

0 

1 

1 

1 

1 

1 

2 

0 

2 

Return  value 
None 
Example 

A  one  dimensional  distributed  array  is  partitioned  in  some  irregular  manner  so  we 
need  a  distributed  translation  table  to  keep  track  of  where  one  can  find  the  value 
of  a  given  element  of  the  distributed  array. 

In  the  example  below,  we  initialize  a  translation  table.  Processor  0  calls  build-translation.tabl 
and  assigns  indices  1  and  4  to  processor  0,  processor  1  calls  build_translation_table 
and  assigns  indices  2  and  3  to  processor  1.  The  translation  table  is  partitioned 
between  processors  in  blocks. 

Processor  0  then  uses  the  translation  table  to  dereference  global  variables  1  and  2, 
processor  1  uses  the  translation  table  to  dereference  global  variables  3  and  4.  On 
each  processor,  dereference  carries  out  a  translation  table  lookup.  The  values  of 
proc  and  local  are  returned  by  dereference  are  shown  in  Table  2).  The  user  gets 
to  specify  the  processor  to  which  each  global  index  is  assigned,  note  however  that 
build-translation.table  assigns  local  indices. 


program  dref 

integer  size,  i,  index_array(2) 
integer  ideref_array(2) 
integer  ilocal (2) ,  iproc(2) 
logical  ifbuild_translation_table 


c  Assign  indices  1  and  4  to  processor  0 
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if  (mynodeO  .eq.O)  then 
index_array(l)  =  1 
index_array(2)  =  4 
endif 

c  Assign  indices  2  and  3  to  processor  1 

if  (mynodeO  .eq.  1)  then 
index.array(l)  =  2 
index, array (2)  =  3 
endif 

c  set  up  a  translation  table 

BLOCKED  =  1 
size  =  2 

itable  =  ifbuild_translation_table (BLOCKED, index.array, size) 

c  Processor  0  seeks  processor  and  local  indices 
c  for  global  array  indices  0  and  i  */ 

if  (mynodeO  .eq.O)  then 
ider ef_ array (1)  =  1 
ideref_array(2)  =  2 
endif 

c  Processor  1  seeks  processor  and  local  indices 
c  for  global  array  indices  2  and  3  */ 

if  (mynodeO  .eq.  1)  then 
ideref_array(l)  =  3 
ideref_array(2)  =  4 
endif 

c  Dereference  a  set  of  global  variables 

call  fdereference(itable,deref .array, local, proc, size) 
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c  local  and  proc  return  the  processors  and  local  indices  where 
c  global  array  indices  are  stored. 

c  In  processor  0,  proc(l)  =  0,  proc(2)  =  1,  local(l)  =  0  ,  local(2)  =  0 
c  In  processor  1,  proc(l)  =  i,  proc(2)  =  0,  local(l)  =  1  ,  local(2)  =  1 
stop 
end 


Now  assume  that  processor  0  needs  to  know  to  values  of  distributed  array  elements 
1,2,  and  4  while  processor  1  needs  to  know  the  value  of  element  3.  We  call  deref¬ 
erence  to  find  the  processors  and  the  local  indices  that  correspond  to  each  global 
index.  At  this  point  schedule  can  be  called  and  gathers  and  scatters  carried  out. 
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A  Sweep  over  the  Edges  of  an  Unstructured  Mesh 

This  code  can  be  found  in  the  directory  examples/unst.  This  goes  through  the  whole 
process  of  setting  up  the  inspector  and  then  the  subroutine  executor  is  called  to  do  the 
actual  computation.  There  is  a  driver  program  which  is  included  in  the  distribution 
but  not  added  in  this  section.  The  executor  is  a  loop  which  has  been  taken  out  of 
a  real  CFD  code,  where  the  loop  is  over  the  edges  of  the  mesh.  In  the  subroutine 
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executor,  if  we  remove  the  calls  to  gather  and  scatter_add  then  the  piece  of  code  looks 
identical  to  the  sequential  version. 


c - 

c  The  subroutines  inspector  and  executor  for  sweep  over  an 
c  arbitrary  unstructured  mesh  is  shown  below, 
c 

c  There  is  a  driver  code  which  calls  these  two  subroutines  after 
c  reading  in  the  mesh  structure  and  initialization  data.  This 
c  shows  how  the  different  PARTI  primitives  can  be  called 
c  from  FORTRAN, 
c 

c - 


c - 

subroutine  inspector (ledge ,myvals , nde) 

c - 

c 

c - 

#include  "commonl.F" 

c - 

common/node/  ntotnodes , nonode , noedge 
common/sched/  lesched 
common/of f proc/  ne_off_proc 
c 

integer  nde (ledge, 2) 
integer  myvals (nonode) 
c 

c -  Local  Variables 

c 

integer  ig_ref_e(nge) 
integer  locale(nge) 
logical  ifbuild_translation_table 
c 

c -  Build  the  translation  table 
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c 

itabptr  =  ifj»ailgjfcsaaslafci<mjSaI»le(l  ,mjvals  ,s«acde) 
c 

c - Setup  global  references  for  edge  loop 

c 

do  20  i  =  1, noedge 

ig_ref_e(i)  =  nde(i,l) 
ig_ref  _e  (no  edge+i )  =  nde(i,2) 

20  continue 

iecount  =  2  *  noedge 
c 

c - Setup  schedule  and  change  global  ref.  to  local  ref. 

c 

call  flocalize(itabptr,lesched,ig_ref_e,locaie, 

iecount ,ne_off_proc,nonode) 
c 

do  40  i  =  1, noedge 
nde(i,l)  =  locale(i) 
nde(i,2)  =  locale (noedge+i) 

40  continue 
c 

return 

end 

c 

c 

c 

c - 

c 

subrout ine  execut or (1 edge , Ino de , nde , gnorm , w , p , dtl , if lop) 
c 

c - 

c 

real*8  rm yaw, gamma, rhoC,p0,ei0,h0,c0,uQ,v0,w0 
real*8  cfl,bc,vis0,visl,vis2,hra,smoop 
c 

corainon/node/  ntotnodes, nonode, noedge 
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ccTOCg/scaed/  lesches 
common/ offproc/  re_off_prcc 

ccmmon/tsp/  c£l,bc,vis0,visi,vIs2,lEm,sa3op,mcycs3n 
comr-oa/flg/  rB,al>j2g,g33na,rsol>,gi)>ei0,l!i),c0>ii0,70>g0 
c 

integer  rde( ledge, 2) 
real*S  grom(ledge ,5) 
real*3  dtl(lEode) 
realms  »(lnode , 5) ,p(lnode) 
c 

c — Local  variables 
c 

real*3  ccl ,  cc2  ,  csl ,  cs2 ,  al ,  a2 ,  qs ,  flux!  ,f  lux2 


c 

c — Initialize  Tiae  Step 
c 

do  50  i=l, nonode 
dtlCi)  =  0.0D0 
50  continue 
c 

c —  Do  all  the  Gathers 
c 

do  60  kk  =  1,4 

call  dfgather(lesched,v(nonode+l,kk) ,w(l,kk)) 
60  continue 

call  dfgather(lesched,p(nonode+l) ,p(l)) 
do  63  i  =  l,ne_off_proc 
dtl(nonode+i)  =  O.ODO 
63  continue 
c 

c — Compute  Field  Time-Steps  Using  Edge  Format 
c 

do  500  i=l, noedge 
nl  =  nde(i,l) 

n2  =  nde(i,2) 

ccl  =  dsqrt(gamma*p(nl)/w(nl,l)) 
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cc2  =  dsqrt (gamma*p (n2) /v (n2 ,  1)  ) 

csl  =  ccl*gnorm(i,4) 
cs2  =  cc2*gnorm(i,5) 

al  =  (gnonn(i,l)*w(nl,2)  +  gnorm(i,2)*w(nl,3) 

+  gnorm(i,3)*v(nl,4))  /  w(nl,l) 
a2  =  (gnorm(i,i)*v(n2,2)  +  gnorm(i,2)*v(n2,3) 

+  gnorm(i,3)*w(n2,4))  /  w(n2,l) 

qs  =  (al  +  a2)  /  2.0D0 

fluxl  =  dabs(qs)  +  csl 
flux2  =  dabs(qs)  +  cs2 
dtl(nl)  =  dtl(nl)  +  flux2 
dtl(n2)  =  dtl(n2)  +  fluxl 
500  continue 

iflop  =  iflop  +  (noedge  *  28) 
c 

c —  Do  all  the  Scatters 
c 

call  dfscatter_add(lesched,dtl(nonode+l) ,dtl(l)) 
c 

return 

end 


B  Example  :  Sparse  matrix  multiplication 

The  following  example  of  symmetric  matrix  vector  multiplication  can  be  found  in  the 
file  matrault .  c  in  the  examples/ sparse jnat_mult  directory.  There  is  a  host  program 
which  is  present  in  the  same  directory  but  has  not  been  listed  here.  The  sparse  matrix 
is  obtained  from  the  host  program  using  the  function  get_sparsejnat  () .  Then  we  go 
through  the  pre-  processing  to  generate  all  the  fetch  lists  and  build  a  schedule  to  bring 
in  off-processor  data.  Lastly,  the  matrix  multiplication  procedure  spmvm()  is  called. 
After  the  multiplication  the  values  are  scattered  using  the  primitive  scatter_add 

/******************************** *****************************/ 

/*  PARTI  program  to  do  a  sparse  matrix-vector  multiplication  */ 
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/*  */ 

/*  This  program  reads  in  a  sparse  matrix  with  the  help  of  */ 

/*  the  host  program  and  does  a  matrix  vector  multiplication.  */ 
/*  The  is  a  listing  of  the  node  program  and  it  is  run  by  the  */ 
/*  host  program.  This  program:  */ 

/*  */ 

/*  1)  gets  unstructured  mesh  (w/  help  from  host  program)  */ 

/*  2)  does  lots  of  memory  and  address  stuff  on  it  */ 

/*  3)  generates  a  vector  x  */ 

/*  4)  multiplies  x  by  the  matrix,  getting  y  */ 

/*  */ 


#include  <cube.h> 

#include  <stdio.h> 

#include  <math.h> 

#include  "parti. h" 

#include  "main.h" 
main(argc,argv) 
int  argc; 
char  *argv  []  ; 

int  i ,  j ,  count ; 

TTABLE  stable; 

SCHED  *sr; 

double  *x,  *y,  *z; 

/* 

* - 

*  Get  sparse  matrix  from  host  program. 

*  - 

*/ 

get_sparse_mat () ; 

/* 
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* - 

*  Build  translation  table  by  scattering  Row  to  the  table. 

*  IN:  Row [i]  OUT:  table 

*  - 

*/ 

table  =  build_translation_table (BLOCKED, Row, Myrows) ; 

/* 

* - 

*  Look  up  address  of  Cols  and  put  them  in  Local  and  Proc. 

*  IN:  Cols [i] , table  OUT:  Local [i] ,Proc [i] 

*  - 

*/ 

dereference (table, Cols, Local ,Proc,Mynonzeros) ; 


/* 

* - 

*  Loop  through  all  proc/offset  pairs  and  decide  which 

*  must  be  fetched  from  other  processors. 

*  IN:  Local [i] , Proc [i]  OUT:  Fetch_l[i] ,Fetch_p[i] 

*  - 

*/ 

gen_fetch_list() ; 


/* 

* - 

*  Allocate  memory  for  vectors  ,  and  set  x[i]  =  i  for  local  i. 

*  - 

*/ 

x  =  (double  *)  malloc(sizeof (double) *Myrows) ; 
y  =  (double  *)  malloc(sizeof (double) *Myrows) ; 

for(i=0;i<Myrows;i++)  x[i]  =  1.0; 
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/* 


*  build  communications  schedule 

*  IN:  Fetch_l[i] ,Fetch_p[i]  OUT:  sr 


*/ 

sr  =  schedule(Fetch_l,Fetch_p,Nfetch) ; 

/* 

* - 

*  Perform  sparse-matrix  vector  multiplication. 

*  - 

*/ 

spmvm(sr,x,y) ; 

> 

/*  END  OF  NODE  PROGRAM  */ 

/* 

* - 

*  This  function  is  used  to  read  in  the  sprse  mat. 

*  It  should  be  ignored  if  at  all  possible. 

*  - 

*/ 

get_sparse_mat () 

{ 

int  size,  indx_buffer[BUFFER_SIZE] ; 
double  coef_buffer[BUFFER_SIZE] ; 
int  type,  rows.expected; 

rows_expected  =  -1; 

Myrows  =  0; 

Mynonzeros  =  0; 
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gsyncO ; 

while (  (Myrows<rows_expected)  I  (rows_expected<0)  ){ 
cprobe(-l) ; 
type  =  infotypeO; 
size  =  infocount()/sizeof (int) ; 
if (  type==ROW_INDX_MSG  ){ 

crecv(  ROW_INDX_MSG,indx_buffer,size*sizeof (int)) ; 
crecv(  ROW_COEF_MSG,coef_buffer,size*sizeof (double)) ; 
unpack_row_data(indx_buff er ,coef .buffer, size) ; 

> 

if (  type==SETUP_MSG  ){ 

crecv(SETUP_MSG , indx_buf f er , size*sizeof (int) ) ; 
rows_expected  -  indx_buffer [mynodeO]  ; 

Nrows  =  indx_buffer  [numnodesQ] ; 

} 

} 

gsyncO ; 


} 

/* 

* - 

*  The  buffers  are  unpacked  in  the  following 

*  procedure 

*  - 

*/ 

unpack_row_data(indx_buf f er , coef  ..buffer .size) 
int  *indx_buffer,size; 
double  *coef .buffer ; 

{ 

int  count,  i.  j,  row,  ncols,  count2,  ixx,  ist; 
double  sum; 

static  int  col.count  =  0; 

for(  count=0;  count<sizej  ){ 

Row [Myrows]  =  indx_buffer [count] ; 
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Diags[Myrows]  =  coef_buffer [count] ; 
sura=Diags [Myrows]  ; 

ncols  =  Ncols [Myrows]  =  indx_buffer [count+1] ; 
count=count+2; 

Mynonzeros  +=  ncols; 

if (  Myrows  >=  MAX_R0WS  ){ 

fprintf  (stderr, "Error  on  node  '/,d  :  too  many  rows!  !  !\n",mynode()); 
exit() ; 

> 

if (  Mynonzeros  >=  MAX.NONZEROS  ){ 

fprintf  (stderr,  "Error  on  node  '/,d  :  too  many  nonzeros !!  !\n" , 
mynodeO)  ; 
exit  ()  ; 

} 

for(  j=0;  j<ncols;  j++){ 

Cols [col_count]  =  indx_buffer [count] ; 

Vals[col_count]  =  coef  ..buffer [count] ; 
sum+^Vals [col_count] ; 
col_count++; 
count++ ; 

} 

Myrows++; 

> 

> 

/* 

* - 

*  This  function  takes  the  Locol[i] ,Proc[i] 

*  address  for  each  nonzero  col  in  the  matrix 

*  and  puts  nonlocal  ones  into  Fetch_l[i] ,Fetch_p[i] 

*  - 

*/ 

gen_f etch_list () 
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{ 

int  count,  i,  myproc; 

myproc  =  mynodeO; 

/*  count  off node  refs.  */ 

Nfetch  =  0; 

for(  i=0;  i<Mynonzeros;  i++)  Nfetch  +=  (Proc[i] !=myproc) ; 

/*  for  each  ref.  */ 

Fetch_p  =  (int  *)  malloc(sizeof (int)*Nfetch*2) ; 

Fetch_l  =  &Fetch_p [Nfetch] ; 
count  =  0; 

for(  i=0;  KMynonzeros;  i++  ){ 
if (  Proc[i]  !=  myproc  ){ 

/*  if  Col[i]  refers  to  an  off-proc  location..  */ 

Fetch_p  [count]  =  Proc[i];  /*  add  it  to  the  fetch  list  */ 
Fetch_l [count]  =  Local [i] ; 
count++; 

> 

> 

> 

/* 

* - 

*  sparse  matrix  vector  multiply  function  ! 

*  require  that  the  schedule  be  built  and  passed  in 

*  - 

*/ 

spmvm(sr,x,y) 

SCHED  *sr;  /*  communication  schedule  */ 
double  *x,  *y;  /*  input  and  result  vectors  */ 

{ 

int  myproc ,  bcount ,  count ,  i ,  j ; 
double  tmp,  *buffer,  *ybuffer; 

/*  Allocate  local  buffer  to  gather  data  into.  */ 
buffer  =  (double  *)  malloc(sizeof (double) *Nf etch) ; 
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/*  Allocate  local  buffer  to  store  output  vector  values  into.  */ 
ybuffer  =  (double  *)  malloc(sizeof (double) *Nf etch) ; 

/*  Gather  data  using  previously  computed  communication  schedule.  */ 
dgather(sr,buffer,x) ; 

myproc  =  mynode() ; 
bcount  =  0; 
count  =  0; 

for(  i=0;  i<Myrows;  i++  )  y[i]=0.0j 
for(  i=0;  i<Nfetch;  i++  )  ybuffer[i]=0.0; 

for(  i=0;  i<Myrows;  i++  ){ 
y  [i]  +=  Diags[i]*x[i]  ; 
for(  j=0;  j<Ncols[i];  j++  ){ 

/*  for  each  nonzero  col  ....  */ 
if(  Proc [count]  ==  myproc  ){ 

/*  if  col [count]  is  local  */ 
y[i]  +=  x [Local [count] ]*Vals [count] ; 
y [Local [count]]  +=  x [i]*Vals  [count] ; 

}  else  { 

/*  otherwise  look  in  buffer  */ 
y [i]  +=  buff er [bcount] *Vals [count] ; 
ybuffer [bcount]  +=  x[i] *Vals [count] ; 
bcount++; 

> 

count ++; 

> 

> 

dscatter_add(sr,ybuffer,y) ; 
gsync(); 

for(  i=0;  i<Myrows;  i++  ){ 

fprintf (myfile,"  after  scatter  processor  */,d,  y['/,d]  =  */,lf\n", 
myproc, i,y[i] ) j 
ff lush(myf ile) ; 

> 

free (buffer) ; 


free(ybuffer) ; 

} 
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